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Ranking temporal data has not been studied until recently, even 
though ranking is an important operator (being promoted as a first- 
class citizen) in database systems. However, only the instant top-k 
queries on temporal data were studied in, where objects with the k 
highest scores at a query time instance £ are to be retrieved. The 
instant top-k definition clearly comes with limitations (sensitive to 
outliers, difficult to choose a meaningful query time £). A more 
flexible and general ranking operation is to rank objects based on 
the aggregation of their scores in a query interval, which we dub 
the aggregate top- A: query on temporal data. For example, return 
the top- 10 weather stations having the highest average temperature 
from 10/01/2010 to 10/07/2010; find the top-20 stocks having the 
largest total transaction volumes from 02/05/2011 to 02/07/2011. 
This work presents a comprehensive study to this problem by de- 
signing both exact and approximate methods (with approximation 
quality guarantees). We also provide theoretical analysis on the 
construction cost, the index size, the update and the query costs of 
each approach. Extensive experiments on large real datasets clearly 
demonstrate the efficiency, the effectiveness, and the scalability of 
our methods compared to the baseline methods. 

1. INTRODUCTION 

Temporal data has important applications in numerous domains, 
such as in the financial market, in scientific applications, and in the 
biomedical field. Despite the extensive literature on storing, pro- 
cessing, and querying temporal data, and the importance of rank- 
ing (which is considered as a first-class citizen in database sys- 
tems [9]), ranking temporal data has not been studied until re- 
cently [15]. However, only the instant top-k queries on temporal 
data were studied in [15], where objects with the k highest scores 
at a query time instance £ are to be retrieved; it was denoted as the 
top-/c(£) query in [15]. The instant top-k definition clearly comes 
with obvious limitations (sensitivity to outliers, difficulty in choos- 
ing a meaningful single query time £). A much more flexible and 
general ranking operation is to rank temporal objects based on the 
aggregation of their scores in a query interval, which we dub the 
aggregate top-k query on temporal data, or top-fc(£i, £2, cr) for an 
interval [£1, £2] and an aggregation function a. For example, return 
the top- 10 weather stations having the highest average temperature 
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Figure 1: MesoWest data. 



from 10/01/2010 to 10/07/2010; find the top-20 stocks having the 
largest total transaction volumes from 02/05/2011 to 02/07/2011. 

Clearly, the instant top-k query is a special case of the aggregate 
top-k query (when t\ = £2). The work in [15] shows that even the 
instant top-k query is hard! 

Problem formulation. In temporal data, each object has at least 
one score attribute A whose value changes over time, e.g., the 
temperature readings in a sensor database. An example of real 
temperature data from the MesoWest project appears in Figure 1. 
In general, we can represent the 
score attribute A of an object as 
an arbitrary function / : R — > 
R (time to score), but for ar- 
bitrary temporal data, / could 
be expensive to describe and 
process. In practice, applica- 
tions often approximate / us- 
ing a piecewise linear function 
g [6, 1, 12, 1 1]. The problem of 
approximating an arbitrary function / by a piecewise linear func- 
tion g has been extensively studied (see [12, 16,6, 1] and references 
therein). Key observations are: 1) more segments in g lead to better 
approximation quality, but also are more expensive to represent; 2) 
adaptive methods, by allocating more segments to regions of high 
volatility and less to smoother regions, are better than non-adaptive 
methods with a fixed segmentation interval. 

In this paper, for the ease of discussion and illustration, we focus 
on temporal data represented by piecewise linear functions. Nev- 
ertheless, our results can be extended to other representations of 
time series data, as we will discuss in Section 4. Note that a lot 
of work in processing temporal data also assumes the use of piece- 
wise linear functions as the main representation of the temporal 
data [6, 1, 12, 11, 14], including the prior work on the instant top- 
k queries in temporal data [15]. That said, how to approximate / 
with g is beyond the scope of this paper, and we assume that the 
data has already been converted to a piecewise linear representa- 
tion by any segmentation method. In particular, we require neither 
them having the same number of segments nor them having the 
aligned starting/ending time instances for segments from different 
functions. Thus it is possible that the data is collected from a vari- 
ety of sources after each applying different preprocessing modules. 

That said, formally, there are m objects in a temporal database; 
the ith object o\ is represented by a piecewise linear function gi 
with m number of (linear line) segments. There are a total of 
N — YlT=i ni segments from all objects. The temporal range 
of any object is in [0,T]. An aggregate top-k query is denoted 
as top-fc(£i, £2, cr) for some aggregation function a, which is to 
retrieve the k objects with the k highest aggregate scores in the 



1412 



range [£i, £2], denoted as an ordered set A(k, £1, £2) (or simply A 
when the context is clear). The aggregate score of Oi in [£1, £2] is 
defined as cr(^(£i, £2)), or simply <Ji(£i,£2), where gi(ti,t 2 ) de- 
notes the set of all possible values of function gi evaluated at every 
time instance in [£1, £2] (clearly an infinite set for continuous time 
domain). For example, when a = sum, the aggregate score for 
Oi in [£1, £2] is J t * 2 gi{t)dt. An example of a sum top-2 query is 
shown in Figure 2, and its answer is {03, 01}. 

Score 




t 2 t 3 Time 

Figure 2: A top-2 (£1, £ 2 , sum) query example. 

For ease of illustration, we assume non-negative scores by de- 
fault. This restriction is removed in Section 4. We also assume a 
max possible value /c ma x for k. 

Our contributions. A straightforward observation is that a solu- 
tion to the instant top-k query cannot be directly applied to solve 
the aggregate top-k query since: 1) the temporal dimension can be 
continuous; and 2) an object might not be in the top-k set for any 
top-k (t) query for £ £ [£1, £2], but still belong to A(k, £1, £2) (for 
example, A(l, £2, £3) in Figure 2 is {01}, even though o\ is never a 
top-l(£) object for any £ £ [£2, £3])- Hence, the trivial solution (de- 
noted as Exact 1) is for each query to compute <Ji(£i, £2) of every 
object and insert them into a priority queue of size k, which takes 
0(m(N + log/c)) time per query and is clearly not scalable for 
large datasets (although our implementation slightly improves this 
query time as described in Section 2). Our goal is then to design 
10 and computation efficient algorithms which can outperform the 
trivial solution and work well regardless if data fits in main mem- 
ory or not. A design principle we have followed is to leverage on 
existing indexing structures whenever possible (so these algorithms 
can be easily adopted in practice). Our work focuses specifically on 
a = sum, and we make the following contributions: 

• We design a novel exact method in Section 2, based on using 
a single interval tree (EXACT3). 

• We present two approximate methods (and several variants) 
in Section 3. Each offers an approximation 3^(£i, £2) on the 
aggregate score <Ji(ti, £2) for objects in any query interval. 
We say X is an (e, a) -approximation of X if X/ct — eM < 
X < X-\-eM for user-defined parameters a > 1, e > and 

where M = J2T=i T )- Now > for * £ [1, m], [ti,t 2 ] C 
[0, T], the AppxI method guarantees that a i (t 1 ,£ 2 ) is an 
(e, 1) -approximation of ai(ti,t 2 ), and the APPX2 method 
guarantees <7i(£i, £2) is an (e, 2 log (1/e)) -approximation of 
0"i(£i,£2). We show an (e, a) -approximation on o-i(ti,t 2 ) 
implies an approximation *4(fc,£i,£2) of *4(/c,£i,£2) such 
that the aggregate score of the jth ranked (1 < j < k) ob- 
ject in A(kj £1, £2) is always an (e, a) -approximation of the 
aggregate score of the jth ranked object in A(k, £1 , £2). 

• We extend our results to general functions / for temporal 
data, other possible aggregates, negative scores, and deal with 
updates in Section 4. 

• We show extensive experiments on massive real data sets in 
Section 5. The results clearly demonstrate the efficiency, ef- 
fectiveness and scalability of our methods compared to the 



Symbol 


Description 


A(k,ti,t 2 ) 


ordered top-k objects for top-/c(£i ,t 2 ,a). 


A(k,t!,t 2 ) 


an approximation of A(k,ti,t 2 ). 


A(j),A(j) 


the jth ranked object in A or A. 


B 


block size. 


B 


set of breakpoints (B\ and B 2 are special cases). 


B(t) 


smallest breakpoint in B larger than t. 


9% 


the piecewise linear function of Oi . 


9i,j 


the jth line segment in gi, j G [1, rii] . 




the set of all possible values of gi in [£1, £2]. 


^max 


the maximum k value for user queries. 


e(t) 


the value of a line segment t at time instance £. 


m 


total number of objects. 


M 


M = ET=i^(0,T). 


rii 


number of line segments in gi . 


Tlavg 


max{m, ri2, • • • , rim}, avg{ni, n 2 , . . . , n m } 


N 


number of line segments of all objects. 


Oi 


the ith object in the database. 


Qi 


number of segments in gi overlapping [t±, t 2 ]. 


r 


number of breakpoints in B, bounded 0(l/e). 


5 ) 


jth end-point of segments in gi, j G [0, rii]. 


&i(ti,t 2 ) 


aggregate score of Oi in an interval [t\ , £2]. 


&i(ti,t 2 ) 


an approximation of cri (£1 , £2). 


[0,T] 


the temporal domain of all objects. 



Table 1: Frequently used notations. 

baseline. Our approximate methods are especially appealing 
when approximation is admissible, given their better query 
costs than exact methods and high quality approximations. 
We survey the related work in Section 6, and conclude in Section 
7. Table 1 summarizes our notations. Figure 3 summarizes the up- 
per bounds on the preprocessing cost, the index size, the query cost, 
the update cost, and the approximation guarantee of all methods. 

2. EXACT METHODS 

As explained in Section 1, a trivial exact solution Exact 1 is to 
find the aggregate score of each object in the query interval and 
insert them into a priority queue of size k. We can improve this 
approach by indexing line segments from all objects with a B+-tree, 
where the key for a data entry e is the value of the time-instance for 
the left-end point of a line segment I, and the value of e is just t. 
Given a query interval [£1 , £2], this B+-tree allows us to find all line 
segments that contains £1 in 0(\og B N) IOs. A sequential scan (till 
£2) then can retrieve all line segments whose temporal dimensions 
overlap with [£1, £2] (either fully or partially). In this process, we 
simply maintain m running sums, one per object in the database. 
Suppose the ith running sum of object o% is Si and it is initialized 
with the value 0. Given a line segment £ defined by (Uj , Vij) and 
(Uj+i , Vij+i) from Oi (see an example in Figure 4), we define an 
interval / = [£i,£2] H [ti,j,U,j+i], let £z, = max{£i, tij} and 
t R = min{£ 2 ,£ ? ; 



and update si — Si + cn(I), where 
0, if £ 2 < t L or £1 > £#; 
5(tR-t L )(£(tR)+£(t L )), else. 



(1) 



Note that £(t) is the value of the line segment £ at time £. Note 
that if we follow the sequential scan process described above, we 
will only deal with line segments that do overlap with the tem- 
poral range [£i,£2], in which the increment to Si corresponds to 
the second case in (1). It is essentially an integral from £l = 
max{£i,£i ;J } to tR = min{£ 2 , Uj+i} w.r.t. £, i.e., f t * £{t)dt. 
This range [£ l , £#] of £ also defines a trapezoid, hence, it is equal 
to the area of this trapezoid, which yields the formula in (1). 

When we have scanned all line segments up to £2 from the B+- 
tree, we stop and assign <Ji(£i, £2) = si for i = 1 to m. Finally, 
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index size 



construction cost 



query cost 



update cost 



approximation 



Exact 1 


o(f) 


0(f log B iV) 


0(log B iV+^%^) 


0(log B JV) 


(0,1) 


Exact2 






°(E™ 1 1o Sb "«) 


0(log B n) 


(0,1) 


Exact3 




0(f log s iV) 


0(log B iV+f) 


0(log B JV) 


(0,1) 


AppxI 


0( "g- fcmax) 


0(§(log B iV + r)) 


0(|+log B r) 


0(i(log B iV + r)) 


M) 


Appx2 


0( -g /c ma x) 


0(f (log B iV + logr)) 


0(k logr) 


0(i(log B iV + logr)) 


(£,2 logr) 



Figure 3: IO costs, with block size B; for simplicity, \og B /c max terms are absorbed in O(-) notation, 

f 

M*3,2, ^3,2) 



Score 




Time 



Figure 4: Compute cr;([£i,£ 2 ] H 

we insert (i, <7i(£i, £2)), for z = 1 to m, into a priority queue of 
size k sorted in the descending order of <7i(£i,£2). The answer 
A(k, ti , ^2) is the (ordered) object ids in this queue when the last 
pair (m, <r m (£i, £2)) has been processed. 

This method EXACT 1 has a cost of 0((N/B) \og B N) IOs for 
building the B+-tree, an index size of 0(N/ B) blocks, and a query 
cost of 0(log s N + YJILi & l B + ( m / B ) lo gs k ) IOs where qi is 
the number of line segments from o% overlapping with the temporal 
range [£1, £2] of a query g=top-/c(£i, £2, sum). In the worst case, 
q i = m for each i, then the query cost becomes 0(N/ B) ! 

A forest of B+-trees. Exact 1 becomes quite expensive when 
there are a lot of line segments in [£1 , £2], and its asymptotic query 
cost is actually 0(N/B) IOs, which is clearly non-scalable. The 
bottleneck of Exact 1 is the computation of the aggregate score 
of each object. One straight forward idea to improve the aggre- 
gate score computation is to leverage on precomputed prefix- sums 
[7]. We apply the notion of prefix-sums to continuous temporal 
data by precomputing the aggregate scores of some selected in- 
tervals in each object; this preprocessing helps reduce the cost of 
computing the aggregate score for an arbitrary interval in an ob- 
) be the jth end-point of segments in g if where 



ject. Let (Uj 

3 e {0,...', 



i}; clearly, the jth segment in gi is then defined 

. . , m}, which we 

[£*,o,£^] for t = 



■ (U,j,Vij)) for j e {1, 
denote as gij. Then define intervals i*, 



. , rii, and compute the aggregate score cr^Ii, 



for each. 

— — I i , 6 



Score 



A ^3,1 




e 3 ,£ = (t 3 ,£, {gz,h^{h,i))) g ^V 


93,2 




1103,5 
1 1 1 


93,1 1 


, ^3 (£i, £3,2) 


>3(£2,£3,4)'^3,6 ? ^ 




\ 1 '^3,5 '*3,6 



^3,0 



*3,1 ti 

h,L- 



^3,3 £2 *3,< 



Figure 5: The method Exact2. 



Time 

: £3,4 



Once we have (Ii,e, cr*(i^))s, we build a B-i-tree to index them. 
Specifically, we make a leaf-level data entry e^t for (1^, ai(Ii^)), 
where the key in e^i is (the right end-point of I^t), and the 
value of includes both ando^!*^). Given {e^i, . . . ,ei, n . } 
for Oi, we bulk- load a B+-tree T using them as the leaf-level data 
entries (see Figure 5 for an example). 

We do this for each object, resulting in m B+-trees. Given Ti, we 
can compute ^(£1, £2) for any interval [£1, £2] efficiently. We first 



find the data entry e^z such that its key value U^l is the first suc- 
ceeding key value of £1 ; we then find the data entry a,R such that its 
key value U,r is the first succeeding key value of £2. Next, we can 
calculate <7i(ti, using g ijL (stored in e ijL ), and Oi(t 2 , U,r) 
using gi,R (stored in e^zi), simply based on (1). Finally, 

0-i(ti,t 2 ) = (Ji(h,R) - Oi(Ii,L) + Pi(ti,U,L) - ^(£2, £z,i?) , (2) 

where <Ji(Ii,R), cn(Ii,L) are available in e^, e^z respectively. 
Figure 5 also gives a query example using 03. 

Once all cri(£i, £ 2 )'s are computed for i = 1, . . . , m, the last 
step is the same as that in Exact 1. 

We denote this method as EXACT2. Finding e*,z and e^zi from 
Ti takes only \og B m cost, and calculating (2) takes O(l) time. 
Hence, its query cost is 0(%2iLi n * +m/B \og B k) IOs. The 
index size of this method is the size of all B+-trees, where TVs size 
is linear to ra\ so the total size is 0(N/ B) blocks. Note that com- 
puting {<Ji(/i,i), . . . , ai(Ii,m)} can be easily done in 0(rn/B) 
IOs, by sweeping through the line segments in gi sequentially from 
left to right, and using (1) incrementally (i.e., computing Gi 
by initializing its value to <Ji(Ii^)). Hence, the construction cost is 
dominated by building each tree T with cost 0((m/B) \og B m). 
The total construction cost is 0(YliLi log s m). 

Using one interval tree. When m is large (as is the case for the real 
data sets we explore in Section 5), querying m B+-trees becomes 
very expensive, partly due to the overhead of opening and closing 
m disk files storing these B+-trees. Hence, an important improve- 
ment is to somehow index the data entries from all m B+-trees in a 
single disk-based data structure. 

Consider any object Oi, let intervals I^i, . . . , I^ ni be the same 
as that in Exact2, where l^t — [t^o , U^] . Furthermore, we define 



intervals I i l5 . . . , 2, 

[£*,o,£i,o]), i.e., 



, . , such that I i 

[U,£-l, ti,l\- 



U,£-i (let Ii,o = 



Next, we define a data entry e^i such that its key is I~ e , and its 
value is o"i(Ii,e)), fori — 1, . . . , ra. Clearly, an object Oi 
yields m such data entries. Figure 6 illustrates an example using 
the same setup in Figure 5. When we collect all such entries from 
all objects, we end up with N data entries in total. We denote these 
data entries as a set I~ ; and it is interesting to note that the key 
value of each data entry in J - is an interval. Hence, we can index 
I~ using a disk-based interval tree S [13,4, 3]. 

e s,£ = {Izi,{gz,i,°z{h,i))) 



-^3,5 

^3,4 



^3,1 



^3,1 I ^3,2 ▼ ^3,3 ^3,4 I ^3,5^3,6 

o-^ o © k — o o o 

Ti h Tinie^ 

Figure 6: The method EXACT3. 

Given this interval tree S, computing <Ji(£i, £2) can now be re- 
duced to two stabbing queries, using £1 and £2 respectively, which 
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return the entries in S whose key values (intervals in I~) con- 
tain t\ or £2 respectively. Note that each such stabbing query re- 
turns exactly m entries, one from each object Oi. This is because 
that: 1) any two intervals I~ x , I~ y for x / y from o\ satisfies 

K* n Kv = 0; 2 > and Ki u K2 u • • • u ir, ni = [0, n 

Now, suppose the stabbing query of £1 returns an entry e^L from 
Oi in £, and the stabbing query of £2 returns an entry a,R from Oi in 
5. It is easy to see that we can calculate en (£1 , £2) just as (2) does 
in Exact 2 (see Figure 6). Note that using only these two stabbing 
queries are sufficient to compute all <Ji(£i, £2)'s for i = 1, . . . , m. 

Given N data entries, the external interval tree has a linear size 
0(N/B) blocks and takes 0((N/B) log s N) IOs to build [4] 
(building entries {e^i, . . . , e^ ni } for Oi takes only 0(n* / B) cost). 
The two stabbing queries take 0(\og B N + m/B) IOs [4]; hence, 
the total query cost, by adding the cost of inserting en (£1 , £2) 's into 
a priority queue of size k, is 0(\og B N + (rn/B) \og B k). 

Remarks. One technique we do not consider is indexing temporal 
data with R-trees to solve aggregate top-/c queries. R-trees con- 
structed over temporal data have been shown to perform orders 
of magnitude worse than other indexing techniques for answering 
instant top-/c queries, even when branch- and-bound methods are 
used [15]. Given this fact, we do not attempt to extend the use of 
R-trees to solve the harder aggregate top-/c query. 

Temporal aggregation with range predicates has been studied in 
the classic work [22,21], however, with completely different ob- 
jectives. Firstly, they dealt with multi-versioned keys instead of 
time-series data, i.e., each key is alive with a constant value dur- 
ing a time period before it gets deleted. One can certainly model 
these keys as temporal objects with constant functions following 
our model (or even piecewise constant functions to model also up- 
dates to keys, instead of only insertions and deletions of keys). But 
more importantly, their definitions of the aggregation [22,21] are 
fundamentally different from ours. The goal in [21] is to compute 
the sum of key values alive at a time instance, or alive at a time in- 
terval intersecting a query interval. The work in [22] extends [21] 
by allowing a range predicate on the key dimension as well, i.e., its 
goal is to compute the sum of key values that 1) are alive at a time 
instance, or alive at a time interval intersecting a query interval; 2) 
and are within a specified query range in the key dimension. 

Clearly, these aggregations [22,21] are different from ours. They 
want to compute a single aggregation of all keys that "fall within" 
(are alive in) a two-dimensional query rectangle; while our goal is 
to compute the aggregate score values of many individual objects 
over a time interval (then rank objects based on these aggregations). 

Zhang et al. [22] also extended their investigation to compute the 
sum of weighted key values, where each key value (that is alive in 
a two-dimensional query rectangle) is multiplied by a weight pro- 
portional to how long it is alive on the time dimension within the 
query interval. This weighted key value definition will be the same 
as our aggregation definition if an object's score is a constant in 
the query interval. They also claimed that their solutions can still 
work when the key value is not a constant, but a function with cer- 
tain types of constraints. Nevertheless, even in these cases, their 
goal is to compute a single sum over all weighted key values for 
an arbitrary two-dimensional query rectangle, rather than each in- 
dividual weighted key value over a time interval. Constructing m 
such structures, a separate one for each of the m objects in our 
problem, and only allowing an unbounded key domain can be seen 
as similar to our Exact 2 method, which on large data corpuses is 
the least efficient technique we consider. These fundamental dif- 
ferences make these works almost irrelevant in providing helpful 
insights for solving our temporal aggregation problems. 



3. APPROXIMATE METHODS 

The exact approaches require explicit computation of <Ji(ti, £2) 
for each of m objects, and we manage to reduce the 10 cost of this 
from roughly N/B to m to m/B. Yet, on real data sets when m 
is quite large, this can still be infeasible for fast queries. Hence 
we now study approximate methods that allow us to remove this 
requirement of computing all m aggregates, while still allowing 
any query [£1 , £2] over the continuous time domain. 

Our approximate methods focus on constructing a set of break- 
points B — {bi, 62, • • • , b r }, bi G [0, T] in the time domain, and 
snapping queries to align with these breakpoints. We prove the re- 
turned value &i (£1 , £2) for any curve (e, 1) -approximates o~i (£1 , £2). 
The size of the breakpoints and time for queries will be independent 
of the total number of segments N or objects m. 

In this section we devise two methods for constructing r break- 
points Breakpoints 1 and BreakPoints2. The first method 
Breakpoints 1 guarantees r = 0(l/e) and is fairly straight- 
forward to construct. The second method requires more advanced 
techniques to construct efficiently and guarantees r = 0(l/e), but 
can be much smaller in practice. 

Then given a set of breakpoints, we present two ways to answer 
approximate queries on them: QUERY 1 and QUERY2. The first ap- 
proach Query 1 constructs 0(r 2 ) intervals, and uses a two-level 
B-i-tree to retrieve the associated top k objects list from the one in- 
terval snapped to by the query. The second approach Query2 only 
builds 0(r) intervals and their associated /c max top objects, and on 
a query narrows the list of possible top /c-objects to a reduced set 
of 0(k log r) objects. Figure 7 shows an outline of these methods. 




Figure 7: Outline of approximate methods. 

We define the following approximation metrics. 
Definition 1 G is an (e, a) -approximation algorithm of the ag- 
gregate scores if for any i G [l,m], [£i,£2] C [0,T], G returns 
£2) such that <7i(£i, £2 )/a — sM < a^(£i, £2) < 0"i(£i, £2) + 
eM, for user-defined parameters a > 1, e > 0. 

Definition 2 For A(k, £1 , £2) (or A(k, £1 , £ 2 )), let A(j) (or A(j)) 
be the jth ranked object in A (or A). R is an (e, a) -approximation 
algorithm of top-/c(£i , £2, a) queries if for any k G [1, /c max ] , [£1 , £2] 
C [0,T], R returns A(k,t u t 2 ) and a J(j) (£1, £ 2 ) for j G [l,fc], 
s.t. (£ l5 t 2 ) is an (e, a) -approximation of (T£^(£i, £2) and 

Definition 2 states that A will be a good approximation of A if 
(e, a) are small, since at each rank the two objects from A and A 
respectively will have really close aggregate scores. This implies 
that the exact ranking order in A will be preserved well by A unless 
many objects having very close (smaller than the gap defined by 
(s,a)) aggregate scores on some query interval; and this is unlikely 
in real datasets when users choose small values of (e, a). 

Appendix (Section 10) shows that an algorithm G satisfying Def- 
inition 1 implies an algorithm R satisfying Definition 2. That said, 
for either BREAKPOINTS 1 or BreakPoints2, QueryI is an (e, 
1) -approximation for <Ji(£i,£2) and A(k, £1, £2); QUERY2 is an 
(e, 2 log r) -approximation for <Tj(£i , £2) and A(k, t\ , £2). Despite 
the reduction in guaranteed accuracy for Query2, in practice its 
accuracy is not much worse than Query 1, and it is 1-2 orders of 
magnitude better in space and construction time; and Query 1 im- 
proves upon EXACT3, the best exact method. 
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3.1 Breakpoints 

Our key insight is that Gi (ti , £2) does not depend on the number 
of segments between the boundary times £1 and £2; it only depends 
on the aggregate a applied to that range. So to approximate the ag- 
gregate score of any object within a range, we can discretize them 
based on the accumulated a value. Specifically, we ensure between 
no two consecutive breakpoints in bj,bj+i £ B does the value 
<Ji(bj, 67+1) become too large for an object. Both sets of break- 
points Bi for Breakpoints 1 and B2 for BreakPoints2 start 
with bo — and end with b r — T. Given bo, they sweep forward 
in time, always constructing bj before bj+i, and define: 

, fX^i^fe A'+i) =eM, in Breakpoints 1, 

bj + i SO \ 

Imax™! <Ti(bj,bj+i) = eM, in BREAKPOINTS2, 

where M — J2™ =1 o'i(0,T). Note that these breakpoints bj are 
not restricted to, and in general will not, occur at the end points of 
segments of some Oi . 

Since the total aggregate YJiLi a ^^ T ) = M > for Break- 
Points 1 there will be exactly r = \l/e + 1] breakpoints as each 
(except for the last b r ) accounts for eM towards the total inte- 
gral. For ease of exposition we will assume that 1/e is integral 
and drop the [•] notation, hence 1/e • eM — M. Next we no- 
tice that Breakpoints 2 will have at most as many breakpoints 
as Breakpoints 1 since max™ x Xi < Y^Li Xi for an y set of 
Xi > 0. However, the inequality is not strict and these quanti- 
ties could be equal; this implies the two cases could have the same 
number of breakpoints. This is restricted to the special case where 
between every consecutive pair bj, bj+i G B exactly one object Oi 
has <Ji(bj, bj+i) = eM and for every other object cy for i ^ 1 
has zero aggregate ov (bj, bj+i) = 0. As we will demonstrate on 
real data in Section 5 in most reasonable cases the size of Break- 
Points 2 is dramatically smaller than the size of BREAKPOINTS 1 . 

Construction of BREAKPOINTS 1. We first need to preprocess all 
of the objects according to individual tuples for each vertex be- 
tween two line segments. Consider two line segments si and S2 
that together span from time th to time tR and transition at time 
£m- If they are part of object Oi then they have values vl = g%[tL), 
vm = 9i(t M ), and v R = gi(t R ). Then for the vertex at (t M ,v M ) 
we store the tuple £m, £h, vl,vm, vr). Then we sort all tu- 
ples across all objects according to £m in ascending order and place 
them in a queue Q. The breakpoints B\ will be constructed by pop- 
ping elements from Q. 

We need to maintain some auxiliary information while process- 
ing each tuple. For each tuple, we can compute the slope of its two 
adjacent segments as wl — (vm — vz,)/(£m — £l) andiun — (vr — 
VM)/(tR — £m). Between each pair of segment boundaries the 
value of an object gi(t) varies linearly according to the slope w^t 
in segment gij. Thus the sum Y^iLi fl f *(*) var i es linearly according 
to W(t) — Y^iLi w i,ti *f eacn ^ m object is currently represented 
by segment g%^t i . Also, at any time £ we can write the summed 
value as V(t) = YULi 9^)- Now for any two time points t\ and 
£2 such that no segments starts or ends in the range (£1, £2), and 
given V(ti) and W(ti) we can calculate in constant time the sum 
EULi *i(ti,t2) = %W(ti)(t 2 - £i) 2 + V(ti)(t 2 - £1). Thus we 
always maintain V(t) and W(t) for the current £. 

Since bo = 0, to construct B\ we only need to show how to 
construct bj+i given bj. Starting at bj we reset to a running 
sum up to a time £ > bj written /(£) = X^Li a i(bj,t). Then 
we pop a tuple (£l, £m, £h, vl,vm,vr) from Q and process it as 
follows. We update the running sum to time £m as 7(£m) = I(t) + 
\ W (£) (£ M - £) 2 + V (£) (£ M - 1) . If I(t M ) < eM, then we update 



V(t M ) = V(t)+W(t)(t M -t),thenW(t M ) = W(t)-w L +w R , 
and pop the next tuple off of Q. 

If 7(£m) > eM, that means that the break point frj+i occurred 
somewhere between £ and £m- We can solve for this time bj+i in 
the equation I(bj+i) = eM as 

b j+1 =*+^§j + V(V(t)r - 2W{t)(I{t) - eM). 

The slope W(t) has not changed, but we have to update V(bj+i) — 
V(t) + W(t) • (bj+i — £). Now we reinsert the tuple at the top of 
Q to begin the process of finding bj+2. Since each of N tuples is 
processed in linear time, the construction time is dominated by the 
0((N/B) \og B N) IOs for sorting the tuples. 

Baseline construction of BreakPoints2. While construction of 
Breakpoints 1 reduces to a simple scan over all segments (rep- 
resented as tuples), computing BREAKPOINTS 2 is not as easy be- 
cause of the replacement of the sum operation with a max. The dif- 
ficulties come in resetting the maintained data at each breakpoint. 

Again, we first need to preprocess all of the objects according to 
individual tuples for each line segment. We store the ^th segment of 
Oi as the tuple s^i — (tL,tR,VL,VR,i) which stores the left and 
right endpoints of the segment in time as £l and tR, respectively, 
and also stores the values it has at those times as vl = gi(tL) 
and vr — gi(tR), respectively. Note for each segment s^t we can 
compute its slope w^t — (or — VL)/(tR — £l). Then we sort 
all tuples across all objects according to £z, in ascending order and 
place them in a queue Q. The breakpoints B2 will be constructed 
by popping elements from Q. 

By starting with bo = 0, we only need to show how to compute 
bj+i given bj . We maintain a running integral U (£) = o~i (bj , £) for 
each object. Thus at the start of a new break point bj, each integral 
is set to 0. Then for each new segment s^t that we pop from Q, 
we update h(t) to h(t R ) = U(t) + (vr- v L )(t R - t L )/2. If 
Ii(pR) < eM, then we pop the next tuple from Q and continue. 

However, if the updated Ii(tR) > eM, then it means we have 
an event before the next segment will be processed from Oi. As 
before with BREAKPOINTS 1, we calculate b j+1 i = t + + 

\J (gi(t)) 2 — 2wi i i(Ii(t) — eM). This is not necessarily the 
location of the next breakpoint bj+i, but if the breakpoint is caused 
by Oi, then this will be it. We call such objects for which we have 
calculated bj+i^ as dangerous. We let bj+i — minfej+i^ (where 
is implicitly 00 if it is not dangerous). To determine the true 
next breakpoint we keep popping tuples from Q until for the current 
tuple £l > bj+i. This indicates no more segment endpoints occur 
before some object o% reaches U(t) = eM. So we set bj+i = 
bj+i, and reset maintained values in preparation for finding bj+2- 

Assuming Q(m/B) internal memory space, this method runs in 
0((N/B) \og B N) IOs, as we can maintain m running sums in 
memory. We can remove this assumption in 0((N/B) \og B N) 
IOs with some technical tricks which we omit the details of for 
space. To summarize, after sorting in 0(\og B N) passes on the 
data, we determine for each segment from each Oi how many seg- 
ments occur again before another segment from o\ is seen. We then 
keep the auxiliary information for each object (e.g. running sums) 
in an IO-efficient priority queue [5] on the objects sorted by the 
order in which a segment from each object will next appear. 

However, with limited internal space or in counting internal run- 
time, this method is still potentially slower than finding BREAK- 
POINTS 1 since it needs to reset each I%(bj+i) — when we reach 
a new breakpoint. This becomes clear when studied from an in- 
ternal memory runtime perspective, where this method may take 
0(rm + N log N) time. 
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Efficient construction of BREAKPOINTS 2. We can avoid the ex- 
tra 0(rm) term in the run time by using clever bookkeeping that 
ensures we do not have to reset too much each time we find a break 
point. Appendix in Section 9.1 of our technical report [10] shows: 

Lemma 1 BREAKPOINTS2 can be built in 0(N log N) time (for 
N > l/e). Its size is r = 0(1/ e); and it takes 0((N/B) \og B N) 
IOs to construct. 

Remarks. For specific datasets there may be other specialized 
ways of choosing breakpoints. For real world datasets, such as the 
MesoWest data as shown in Figure 1, our methods are both efficient 
and have excellent approximation quality (see Section 5). 

3.2 Index Breakpoints and Queries 

Given a set of breakpoints B (either B\ or B 2 ), we show how 
to answer queries on the full dataset approximately. The approx- 
imation guarantees are based on the following property that holds 
for Breakpoints 1 Bi and BreakPoints2 B 2 . For any query 
interval (£1,^2), let (B(ti), B(t 2 )) be the associated approximate 
interval, where B(t±) (resp. B(t 2 )) is the smallest breakpoints in 
B such that B(t±) > t\ (resp. B(t 2 ) > £2); see Figure 8. 




B(h) B(t 2 ) 

Figure 8: Associated approximate interval. 

Lemma 2 For any query [£i,£2] and associated approximate in- 
terval [B(t!),B(t 2 )]: Voi, \(Ti(t u t 2 ) -ai(B(t!),B(t 2 ))\ <sM. 

PROOF. Both Bi and B 2 guarantee that between any two con- 
secutive breakpoints bj ,6^+1 £ B that for any object ai(bj , 6j+i) 
< eM. This property is guaranteed directly for BreakPoints2, 
and is implied by BREAKPOINTS 1 because for any object Oi it 
holds that <Ji(ti, t 2 ) < a j ^2) for each aj (£1, t 2 ) > 0, 

which is the case since we assume positive scores (this restriction 
is removed in Section 4). 

Hence, by changing the query interval from \t\ , t 2 ] to [B(ti), t 2 ] 
the aggregate can only decrease, and can decrease by at most eM. 
Also, by changing the interval from [B(ti), t 2 ] to [B(ti), B(t 2 )] the 
aggregate can only increase, and can increase by at most eM. Thus 
the inequality holds since each endpoint change can either increase 
or decrease the aggregate by at most eM. □ 

We now present two query methods, and associate data struc- 
tures, called Query 1 and Query2. 

Nested B+-tree queries. For Query 1 we consider all Q inter- 
vals with a breakpoint from B at each endpoint. For each of these 
intervals [bj,bjt], we construct the /c max objects with the largest 
aggregate ai(bj, bj'). Now we can show that this nested B-i-tree 
yields an (e, 1) -approximation for both the aggregate scores and 
A(k, ti,t 2 ) for any k < k max . 

To construct the set of /c max objects associated with each inter- 
val [bj, bj'] we use a single linear sweep over all segments using 
operations similar to EXACT 1 . Starting at each breakpoint bj , we 
initiate a running integral for each object to represent the intervals 
with bj as their left endpoint. Then at each other breakpoints bj> 
we output the /c max objects with largest running integrals starting 
at each bj up to by to represent \bj,bj>]. That is, we maintain 
0(r) sets of m running integrals, one for each left break point 
bj we have seen so far (to avoid too much internal space in pro- 
cessing all N segments, we use a single IO-efficient priority queue 



as in constructing BreakPoints2, where each of m objects in 
the queue now also stores 0(r) running sums.) We also main- 
tain 0(r) priority queues of size /c max for each left endpoint bj, 
over each set of m running integrals on different objects. This 
takes 0((N/B)(\og B (mr) +r\og B /c max ) + r(rk max /B + 1)) 
IOs, where the last item counts for the output size (since we have 
0(r 2 ) intervals and each interval stores /c max objects). We assume 
rk max < N (to simplify and so index size 0(r 2 /c max ) is feasible); 
hence, the last term is absorbed in O(-). 

To index the set of these intervals, we use nested set of B+-trees. 
We first build a B-i-tree T top on the breakpoints B. Then for each 
leaf node associated with bj, we point to another B-i-tree Tj on Bp 
where B' 3 ■, = {b G B | b > b 3 ;}. The top level B-i-tree T top indexes 
the left endpoint of an interval [bj, bj'] and the lower level B-i-tree 
Tj pointed to by bj in T top indexes the right end point by (for all 
bj' > bj). We build 0(r) B+-trees of size 0(r), hence, this step 
takes 0(r 2 / B) IOs (by bulkloading). Again, we assume r 2 < N, 
and this cost will also be absorbed in the construction cost. 

Now we can query any interval in 0(\og B r) time, since each 
B-i-tree requires 0(\og B r) to query, and for a query top-/c(£i, £2, 
a), we use T top to find B(ti), and the associated lower level B+- 
tree of B(ti) to find B(t 2 ), which gives the top /c max objects in 
interval [B(t±), B(t 2 )]. We return the top k objects from them as 
A (see Figure 9). The above and Lemma 2 imply the following 
results. 



A(kMM) 



Figure 9: Illustration of Query 1. 

Lemma 3 Given breakpoints B of size r (r 2 < N and r/c max < 
N), Query 1 takes 0((N/B) (\og B (mr) + r \og B k max )) IOs to 
build, has size G(r 2 /c max /5), and returns (e, 1) -approximate top- 
ic queries, for any k < /c max , in 0(k/B + log s r) IOs. 
Dyadic interval queries. Query 1 provides very efficient queries, 
but requires ^(r 2 /c max /B) blocks of space which for small val- 
ues of e can be too large (as r = 0(1/ e) in both types of break- 
points). For arbitrarily small e, it could be that r 2 > N. It also 
takes Q(rN log /c max ) time to build. Thus, we present an alterna- 
tive approximate query structure, called Query2, that uses only 
0(r/cma X / B) space, still has efficient query times and high em- 
pirical accuracy, but has slightly worse accuracy guarantees. It is a 
(e, 2 log r) -approximation for both cr^ (£1 , £2) and A(k, t\ , £2). 

We consider all dyadic intervals, that is all intervals [bj,bj/] 
where j — h2 £ + 1 and j' — (h + 1)2 £ for some integer < 
£ < log r and < h < r/2 e — 1. Intuitively, these intervals repre- 
sent the span of each node in a balanced binary tree. At each level £ 
the intervals are of length 2 l , and there are \r/2 e ] intervals. There 
are less than 2r + log r such intervals in total since there are r at 
level 0, \r/2] at level 1, and so on, geometrically decreasing. 

As with Query 1 for each dyadic interval [bj, bj'] we find the 
kmax objects with the largest o-i(bj, bj') in a single sweep over all 
TV segments. There are log r active dyadic intervals at any time, 
one at each level, so we maintain log r running integrals per ob- 
ject. We do so again using two IO-efficient priority queues. One 
requires 0((1/B) \og B (m\ogr)) IOs per segment, the elements 
correspond to objects sorted by which have segments to processes 
next, and each element stores the log r associated running integrals. 
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The second is a set of log r IO-efficient priority queues of size 
fcmax, sorted by the value of the running integral; each requires 
0((1/B) log B Lax) IOs per segment. The total construction is 

0((N/B)(log B (mlogr) +logrlog s fc max )) IOs. 




Figure 10: Illustration of Query2. 

In dyadic intervals any interval [bi , 62] can be formed as the dis- 
joint union of at most 2 log r dyadic intervals. We use this fact as 
follows: for each query interval [£1, £2] we determine the at most 
2 log r dyadic intervals that decompose the associated approximate 
query interval [B(ti), #(£2)]. For each such dyadic interval, we 
retrieve the top-k objects and scores from its associated top-/c max 
objects (k < fc m ax), and insert them into a candidate set JC, adding 
scores of objects inserted more than once. The set JC is of size 
at most k2 log r. We return the k objects with the top k summed 
aggregate scores from JC. 

Lemma 4 Query2 (e, 2 log r) -approximations A(k, £1 , £2). 

PROOF. Converting [£1, £2] to [B(ti), B(t 2 )] creates atmosteM 
error between <7i(£i, £2) and cn(B(ti), #(£2)), as argued in Lemma 
2. This describes the additive eM term in the error, and allows us 
to hereafter consider only the lower bound on score over the ap- 
proximate query interval [B(ti), #(£2)]. 

The relative 2 log r factor is contributed to by the decomposi- 
tion of [B(ti)j #(£2)] into at most 21ogr disjoint intervals. For 
each object Oi £ A(t±, £2), some such interval [bj, bj>] must sat- 
isfy (Ti(bj, b jf ) > o-;(B(£i),B(£ 2 ))/(21ogr). For this interval, 
if Oi is in the top-k then we return a value at least cn(bjj bjf) > 
<Ti(B(£i),5(£ 2 ))/(21ogr). If o % is not in the top-k for [bj,bj>] 
then each object <v that is in that top-k set has 

(B(ti), > <V (b, ,b j ,)>a i (b, , b r ) > am ^lf r it2)) • 

Thus, there must be at least k objects tv £ *4(i3(£i), ^(£2)) with 
^(B(£i),B(£ 2 )) >^(B(£i),B(£ 2 ))/(21ogr). □ 

To efficiently construct the set JC of at most &2 log r potential 
objects to consider being in A(k, £1, £2), we build a balanced bi- 
nary tree over B. Each node (either an internal node or leaf node) 
corresponds to a dyadic interval (see Figure 10). We construct the 
set of such intervals that form the disjoint union over [B(t\) , #(£2)] 
as follows. In phase 1, starting at the root, if [£1, £2] is completely 
contained within one child, we recurse to that child. Phase 2 begins 
when [£1 , £2] is split across both children of a node, so we recur on 
each child. On the next step Phase 3 begins, we describe the pro- 
cess for the left child; the process is symmetric for the right child. 
If £1 is within the right child, we recur to that child. If £1 is within 
the left child, we return the dyadic interval associated with right 
child and recur on the left child. Finally, if £1 separates the left 
child from the right child, we return the dyadic interval associated 
with the right child and terminate. Since the height of the tree is at 
most log r, and we return at most one dyadic interval at each level 
for the right and left case of phase 3, then there are at most 2 log r 
dyadic intervals returned. The above idea can be easily generalized 
to a B+-tree (simply with larger fanout) if r is large. 

Lemma 5 Given breakpoints B of size r, QUERY2 requires size 
6(r/c max /£), takes O ((N /B) (log B (m log r)-\-logr log B fc max )) 
cost to build, and answers (e, 2 log r) -approximate top-k queries, 
for any k < k max , inO(k log r log B k) IOs. 



Proof. The error bound follows from Lemma 4, and the con- 
struction time is argued above. The query time is dominated by 
maintaining a size k priority queue over the set JC with 0(k log r) 
objects inserted, from k objects in 0(log r) dyadic intervals. □ 

3.3 Combined Approximate Methods 

Finally we formalize different approximate methods: APPXl-B, 
APPX2-B, APPXl, APPX2. As shown in Figure 7 the methods vary 
based on how we combine the construction of breakpoints and the 
query structure on top of them. APPXl and APPX2 use BREAK- 
Points2 followed by either Query 1 or Query2, respectively. As 
we will demonstrate in Section 5, BreakPoints2 is superior to 
BREAKPoiNTSl in practice; so, we designate APPXl-B (BREAK- 
POINTS 1 +QUERY1) the basic version of APPXl, and APPX2-B 
(Breakpoints 1 +Query2) the basic version of Appx2. 

The analysis between the basic and improved versions are largely 
similar, hence, we only list the improved versions in Table 3. In 
particular, for the below results, since r = 0(l/e) in Break- 
Points 1 , we can replace r with 1/e for the basic results. 

APPXl computes r — 0(l/e) breakpoints Bi using Break- 
Points2 in 0((N/B) log B (N/B)) IOs. Then Query 1 requires 
0(r 2 /c max /£) space, 0((N/ B) (log B (mr) + r log B k max )) con- 
struction IOs, and can answer (e, 1) -approximate queries in 0(k/B 
+ log B r) IOs. Since m, r < N, this simplifies the total construc- 
tion IOs to O ((N/B) (log B N + rlog B fc max ), the index size to 
0(r 2 k m &x/B) and the IOs for an (e, 1) -approximate top-k query 
to 0(k/B + log B r). 

In APPX2, QUERY2 has 0(rk ma ^/B) space, builds in 0((N/B) 
(log B (m log r) + log r log B k max )) IOs, and answers (e, 2 log r)- 
approximate queries in 0(klogrlog B k) IOs. As m,r < N, the 
bounds simplify to 0((N/B) (log B N + logr log B fc max )) build 
cost, 0(/clogrlog s k) query IOs, and 0(r/c max / ' B) index size. 
We also consider a variant APPX2+, which discovers the exact ag- 
gregate value for each object in JC using a B-i-tree from EXACT2. 
This increases the index size by 0(N/B) (basically just storing 
the full data), and increases the query IOs to 0(klogrlog B k), 
but significantly improves the empirical query accuracy. 

4. OTHER REMARKS 

Updates. In most applications, temporal data receive updates only 
at the current time instance, which extend a temporal object for 
some specified time period. In this case, we can model an update 
to an object o% as appending a new line segment gi, ni +i to the end 
of gi, where that ^, ni +i's left end-point is (U, n . , Vi, n .) (the right 
end-point of gi, ni )\ gi, ni +is right end-point is (£^+1,^,^+1). 

Handling updates in exact methods are straightforward. In EX- 
ACT 1, we insert anew entry (U in . , gi, ni +i) into the B+-tree; hence 
the update cost is 0(log B N) IOs. In EXACT2, we insert a new en- 
try (£i,n,+i, (gi,m+i,(7i(Ii,ni+i)) to the B+-treeT^, where U, n ,+i 
= [ti,o,ti :rii+ i]. We can compute cri(Ii, ni +i) based on (Ti(I^ ni ) 
and gi,m+i in O(l) cost; and Gi (Ii, ni ) is retrieved from the last en- 
try in Ti in 0(log B m) IOs. So, the update cost is 0(log B m) IOs. 
In EXACT3, anew entry ([t ijni ,t ijni+ i], (^,^+1,^(^,^+1))) is 
inserted into the interval tree S. For similar arguments, o~i (Ii, ni ) is 
retrieved from S in 0(log B N) IOs; and then o~i(Ii,ni+i) is com- 
puted in O(l). The insertion into S is 0(log B N) IOs [4]. Thus 
the total update is 0(log B N) IOs. 

Handling updates in approximate methods is more complicated. 
As such, we described amortized analysis for updates. This ap- 
proach can be de-amortized using standard technical tricks. The 
construction of breakpoints depends on a threshold r = eM; how- 
ever, M increases with updates. We handle this by always con- 
structing breakpoints (and the index structures on top of them) us- 
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ing a fixed value of r, and when M doubles, we rebuild the struc- 
tures. For this to work, we assume that it takes Q(N) segments 
before M doubles; otherwise, a segment £ could have an aggregate 
of M/2, and one has to rebuild the entire query structure imme- 
diately after seeing i. Thus in an amortized sense, we can amor- 
tize the construction time C(N) over Q(N) segments, and charge 
0(C(N) /N) to the update time of a segment. 

We also need to maintain a query structure and set of breakpoints 
on top of the segments just added. Adding the breakpoints can 
be done by maintaining the same IO-efficient data structures as in 
their initial construction, using 0( ^ log s N) IOs per segment. To 
maintain the query structures, we again maintain the same auxiliary 
variables and running integrals as in the construction. Again, as- 
suming that there are Q(N/r) segments between any pair of break- 
points, we can amortize the building of the query structures to the 
construction cost divided by N. The amortized reconstruction or 
incremental construction of the query structures dominate the cost. 
For AppxI we need (log B N + r \og B fc max )) IOs t0 update 
Query 1. For Appx2 we need 0( ^ (log B N + log r \og B fc max )) 
IOs to update Query2. 

General time series with arbitrary functions. In some time se- 
ries data, objects are described by arbitrary functions /, instead of 
piece wise linear functions g. However, as we explained in Section 
1, a lot of efforts have been devoted to approximate an arbitrary 
function / using a piecewise linear function g in general time se- 
ries (see [17] and references therein). Furthermore, to understand 
the flexibility of our methods, it is important to observe that all 
of our methods also naturally work with any piecewise polynomial 
functions p: the only change is that we need to deal with polynomial 
curve segments, instead of linear line segments. This only affects, 
in all our methods, how to compute o-i(I) of an interval /, which 
is a subinterval of the interval defined by the two end-points of a 
polynomial curve segment pij (the jth polynomial function in the 
ith object). But this can be easily fixed. Instead of using (1) based 
on a trapezoid, we simply compute it using the integral over pij, 
i.e., o-i(I) = f teI Pi,j(t)d(t). Given that j(t) is a polynomial 
function, this can be easily computed. That said, when one needs 
more precision in representing an arbitrary time series, either one 
can use more line segments in a piecewise linear representation, 
or one can use a piecewise polynomial representation. All of our 
methods work in both cases. 

Negative values. We have assumed positive score values so far. 
But this restriction can be easily removed. Clearly, it does not affect 
our exact methods at all. In the approximate methods, when com- 
puting the breakpoints (in either approach), we use the absolute 
values instead to define M and when searching for a breakpoint. 
We omit technical details due to the space constraint, but we can 
show that doing so will still guarantee the same approximations. 

Other aggregates. Our work focuses on the sum aggregation. This 
automatically implies the support to the avg aggregation, and many 
other aggregations that can be expressed as linear combinations of 
the sum (such as F 2 , the 2nd frequency moment). However, rank- 
ing by some holistic aggregates is hard. An important one in this 
class is the quantile (median is a special case of the quantile). We 
leave the question of how to rank large temporal data using some 
of the holistic aggregates (e.g., quantile) as an open problem. 

5. EXPERIMENTS 

We design all of our algorithms to efficiently consider disk IOs; 
in particular, we implemented all our methods using the TPIE- 
library in C++ [2]. This allows our methods to scale gracefully 
to massive data that does not fit in memory. All experiments were 



performed on a Linux machine with an Intel Core i7-2600 3.4GHz 
CPU, 8GB of memory, and a 1TB hard drive. 

Datasets. We used two large real datasets. The first dataset is a 
temperature dataset, Temp, from the MesoWest project [8]. It con- 
tains temperature measurements from Jan 1997 to Oct 2011 from 
26,383 distinct stations across the United States. There are almost 
N=2.6 billion total readings from all stations with an average of 
98,425 readings per station. For our experiments, we preprocessed 
the Temp dataset to treat each year of readings from a distinct sta- 
tion as a distinct object. By aligning readings in this manner we 
can ask which k stations had the highest aggregate temperatures in 
a (same) time interval amongst any of the recorded years. After pre- 
processing, Temp has m= 145,628 objects with an average number 
of readings per object of n avg =17,833. In each object, we connect 
all consecutive readings to obtain a piecewise-linear representation. 

The second real dataset, Meme, was obtained from the Meme- 
tracker project. It tracks popular quotes and phrases which appear 
from various sources on the internet. The goal is to analyze how 
different quotes and phrases compete for coverage every day and 
how some quickly fade out of use while others persist for long pe- 
riods of time. A record has 4 attributes, the URL of the website 
containing the memes, the time Memetracker observed the memes, 
a list of the observed memes, and links accessible from the website. 
We preprocess the Meme dataset, converting each record to have a 
distinct 4-byte integer id to represent the URL, an 8-byte double to 
represent the time of the record, and an 8-byte double to represent a 
record's score. A record's score is the number of memes appearing 
on the website, i.e. it is the cardinality of the list of memes. Af- 
ter preprocessing, Meme has almost ra=1.5 million distinct objects 
(the distinct URLs) with Af=100 million total records, an average 
of n avg =67 records per object. For each object, we connect every 
two of its consecutive records in time (according to the date) to 
create a piecewise linear representation of its score. 

Setup. We use Temp as the default dataset. To test the impact 
of different variables, we have sampled subsets of Temp to create 
datasets of different number of objects (m), different number of 
average line segments per object (n avg , by limiting the maximum 
value T). By default, m = 50, 000 and n avg = 1, 000 in Temp, 
so all exact methods can finish in reasonable amount of time. Still, 
there are a total of N — 50 x 10 6 line segments! The default values 
of other important variables in our experiments are: /c max = 200, 
k = 50, r = 500 (number of breakpoints in both BREAKPOINTS 1 
and BREAKPOINTS2), and (t 2 - ti ) = 20%T. The disk block size 
in TPIE is set to 4KB. For each query-related result, we generated 
100 random queries and report the average. Lastly, in all datasets, 
all line segments are sorted by the time value of their left end-point. 

Number of breakpoints. We first investigate the effect of the num- 
ber of breakpoints r on different approximate methods, by chang- 
ing r from 100 to 1000. Figure 11 shows the preprocessing results 
and Figure 12 shows the query results. Figure 11(a) indicates that 
given the same number of breakpoints, the value of the error pa- 
rameter e using Breakpoints 2 B2 is much smaller than that in 
Breakpoints 1 B\ in practice; this confirms our theoretical anal- 
ysis, since r = 1/e in Bi, but r = 0(l/e) in B2. This suggests 
that B2 offers much higher accuracy than B\ given the same budget 
r on real datasets. With 500 breakpoints, e in B2 reduces to almost 
10~ 8 , while it is still 0.02 in B±. Figure 11(b) shows the build 
time of B\ and B2. Clearly, building B\ is independent to r since 
its cost is dominated by the linear sweeping of all line segments. 
The baseline method for building B2, BREAKPOINTS2-B clearly 
has a linear dependency on r (on m as well, which is not reflected 
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by this experiment). However, our efficient method of building 
B2, Breakpoints 2-E, has largely removed this dependency on 
r as shown in Figure 11(b). It also removed the dependency on m, 
though not shown. In what follows, BREAKPOINTS 2-E was used 
by default. Both B\ and Bi can be built fairly fast, in only 80 and 
100 seconds respectfully when r = 500 (over 50 x 10 6 segments!). 

Next, we investigate the index size and the construction cost of 
approximate methods, using Exact 3 as a reference (as it has the 
best query performance among all exact methods). Figure 11(c) 
shows that all approximate methods have much smaller size than 
EXACT3, except Appx2+ which also builds EXACT2 since it cal- 
culates the exact aggregate score for candidates in JC from Appx2. 
Clearly, AppxI-B and AppxI have the same size, basic and im- 
proved versions only differ in which types of breakpoints they in- 
dex using the two-level B+-trees. For the same reason, APPX2-B 
and Appx2 also have the same size; they index B\ or B2 using a bi- 
nary tree over the dyadic intervals. APPX2-B and Appx2 only have 
size 0(rA; max ), while AppxI-B and AppxI have size 0(r 2 A) ma x) 
and EXACT3 and APPX2+ have linear size O(N), which explains 
that the size of APPX2-B and Appx2 is more than 2 orders mag- 
nitude smaller than the size of AppxI-B and AppxI, which are 
in turn 3-2 orders magnitude smaller than Exact3 and APPX2+ 
when r changes from 100 to 1000. In fact, APPX2-B and APPX2 
take only 1MB, and AppxI-B and AppxI take only 100MB, when 
r = 1000; while Exact3 and APPX2+ take more than 10GB. 
Construction time (for building both breakpoints and subsequent 
query structures) for approximate methods (including APPX2+) are 
much faster than EXACT3, as shown in Figure 1 1(d). All structures 
build in only 100 to 1000 seconds. Not surprisingly, APPX2-B and 
Appx2 are the fastest, since they only need to find the top /c max ob- 
jects for 0(r) intervals; while AppxI-B and AppxI need to find 
the top fc max objects for 0(r 2 ) intervals. Even APPX2+ is sig- 
nificantly faster to build than EXACT3 since Exact2 builds faster 
than EXACT3. All approximate methods are generally faster to 
build than Exact3, by 1-2 orders of magnitude (except for AppxI 
when r reaches 1000) since the top /c max objects can be found in a 
linear sweep over all line segments as explained in Section 3.2. 

In terms of the query performance, we first examine the approxi- 
mation quality of all approximate methods, using both the standard 
precision/recall (between A and A), and the average of the approx- 
imation ratios defined as (fi(ti,t2)/(Ti(ti, £2) for any Oi returned 
in A. Since \A\ and \A\ are both k, the precision and the recall 
will have the same denominator value. Figure 12(a) shows that all 
approximate methods have precision/recall higher than 90% even 



in the worst case when r — 100; in fact, AppxI and APPX2+ 
have precision/recall close to 1 in all cases. Figure 12(b) further 
shows that AppxI, AppxI-B, and APPX2+ have approximate ra- 
tios on the aggregate scores very close to 1, where as Appx2 and 
APPX2-B have approximation ratios within 5% of 1. In both fig- 
ures, AppxI and Appx2 using B2 are indeed better than their basic 
versions APPXl-B and APPX2-B using Bi, since given the same 
number of breakpoints, B2 results in much smaller e values (see 
Figure 11(a)). Similar results hold for APPX2+, and are omitted 
to avoid clutter. Nevertheless, all methods perform much better 
in practice than their theoretical error parameter e suggests (which 
indicates worst-case analysis). Not surprisingly, both types of ap- 
proximation qualities from all approximate methods improve when 
r increases; but r = 500 already provides excellent qualities. 

Finally, in terms of query cost, approximate methods are clear 
winners over the best exact method EXACT3, with better IOs in 
Figure 12(c) and query time in Figure 12(d). In particular, APPXl- 
B and AppxI (reps. APPX2-B and Appx2) have the same IOs 
given the same r values, since they have identical index structures 
except different values of entries to index. These four methods 
have the smallest number of IOs among all methods, in particu- 
lar, 6-8 IOs in all cases. All require only two queries in a B-i-tree 
of size r; a top-level and lower-level tree for AppxI and AppxI- 
B, and a left- and right-endpoint query for APPX2 and APPX2-B. 
APPX2+ is slower with about 100 to 150 IOs in all cases, due to 
the fact that after identifying the candidate set /C, it needs to ver- 
ify the exact score of each candidate. But, since it only needs to 
deal with 2k log r candidates in the worst case, and in practice, 
I AC I <C 2/clogr, its IOs are still very small. In contrast, the best 
exact method Exact3 takes more than 1000 IOs. 

Smaller IO costs lead to much better query performance; all ap- 
proximate methods outperform the best exact method Exact3 by 
at least 2 orders of magnitude in Figure 12(d). In particular, they 
generally take less than 0.01 seconds to answer a top-50(£i, £2, 
sum) query, in 20% time span over the entire temporal domain, 
over 50 x 10 6 line segments from 50, 000 objects; while the best 
exact method Exact3 takes around 1 second for the same query. 
The fastest approximate method only takes close to 0.001 second! 

From these results, clearly, AppxI and Appx2 using B2 are bet- 
ter than their corresponding basic versions APPXl-B and APPX2- 
B using Bi, given the same number of breakpoints; and r = 500 
already gives excellent approximation quality (the same holds for 
Appx2+, which we omit to avoid clutter). As such, we only use 
AppxI, Appx2, and Appx2 + for the remaining experiments with 



1420 




^EXACTl EXACT2 -&Exact3 

AppxI -$-Appx2 Appx2+ 



O i C 




» » » 



■^EXACTl EXACT2 -BEXACT3 

2 -#- AppxI -$-Appx2 Appx2_i 




Objects m (xlO 3 ) 

(a) Index size. 



>EXACTl EXACT2 -&EXACT3 

AppxI -$-Appx2 Appx2+ 



Objects m (xlO 3 ) 



Objects m ( x 10 3 ) 



Objects m (xlO 3 ) 



(b) Build time. (c) Query I/Os. 

Figure 13: Vary number of objects m on Temp. 



(d) Query time. 



>EXACTl OEXACT2 -&EXACT3 

AppxI -$-Appx2 Appx2+ 



$ 

1 10 50 

Average segments n av 

(a) Index size. 



(xlO 2 ) 




[ Exact 1 Exact2 &Exact3 
AppxI -$-Appx2 Appx2+ 




Exact! Exact2 -B-Exact3 
AppxI -$-Appx2 Appx2 + q 



Average segments n avs: (xlO ) 

(b) Build time. 



1 10 50 100 

Average segments n ave: (xlO 2 ) 




(c) Query I/Os. 



Figure 14: Vary average number of segments n avg on Temp. 





fl.00*=±fc 



30 50 100 14E 

Objects m (xlO 3 ) 

(a) m vs. Precision/Recall. 



AppxI ^Appx2 ^Appx2+ 

30 50 100 14 

Objects m (xlO 3 ) 

(b) m vs. Ratio. 



Average segments n ave; (xlO 2 ) 

(d) Query time. 

AppxI *£Appx2 Appx2+ . 




(c) n av g vs. Precision/Recall. 



(d) n avg vs. Ratio. 



r = 500. Among the three, APPX2+ is larger and slower to build 
than AppxI, followed by Appx2; the fastest to query are AppxI 
and APPX2, then APPX2+; but AppxI and APPX2+ have better 
approximation quality than APPX2 (as shown in later experiments 
and as suggested by their theoretical guarantees for AppxI). 



Scalability. Next, we investigate the scalability of different meth- 
ods, using all three exact methods and the three selected approxi- 
mate methods, when we vary the number of objects m, and the av- 
erage number of line segments per object n avg , in the Temp dataset. 
Figures 13, 14, and 15 show the results. In general, the trends are 
very consistent and agree with our theoretical analysis. All exact 
methods consume linear space O(N) and takes 0(N log N) time 
to build. EXACT3 is clearly the overall best exact method in terms 
of query costs, outperforming the other two by 2-3 orders of mag- 
nitude in terms of IOs and query time (even though it costs slightly 
more to build). In general, EXACT3 takes hundreds to a few thou- 
sand IOs, and about 1 to a few seconds to answer an aggregate 
top-/c(ti, £2, sum) query in the Temp dataset (with a few hundred 
million segments from 145,628 objects). Its query performance is 
not clearly affected by n avg , but has a linear dependency on m. 

The approximate methods consistently beat the best exact algo- 
rithm in query performance by more than 2 orders of magnitude 
in terms of running time. Even on the largest dataset with few 
hundred million segments from 145,628 different objects, they still 
take less than 0.01 seconds per query! Among the three, AppxI 
and Appx2 clearly take fewer IOs, since their query cost is actually 
independent of both m and n avg ! Appx2+'s query IO does depend 
on logn avg , but is independent of m; hence, it is still very small. 
AppxI (and even more so APPX2+) occupy much more space, and 
takes much longer to build. Nevertheless, both AppxI and Appx2 
have much smaller index size than Exact 3, by 4 (AppxI) and 6 
(Appx2) orders of magnitude respectively. More importantly, their 
index size is independent of both m and n\ In terms of the con- 



Figure 15: m and n avg vs approximation quality for Temp. 

struction cost, APPX2-B is the most efficient to build (1-2 orders 
of magnitude faster than all other methods except APPX2). 

Figure 15 shows that both AppxI and APPX2+ retain their high 
approximation quality when m or n avg vary; despite some fluc- 
tuation, precision/recall and approximation ratios in both AppxI 
and APPX2+ stay very close to 1. Appx2 remains at an accept- 
able level of accuracy, especially considering the index size is 1MB 
from 50GB of data! Although the precision/recall drops as n avg 
and m increases, the very accurate approximation ratio indicates 
this is because there are many very similar objects. 



Query time interval. Based on our cost analysis, clearly, the length 
of the query time interval does not affect the query performance of 
most of our methods, except for Exact 1 that has a linear depen- 
dency on (£2 — £1) (since it has to scan more line segments). In 
Figure 16(a) and 16(b) we notice Exact 1 has a linear increase in 
both I/Os and running time (note the log-scale of the plots) and 
even for small (2%T) query intervals, it is still much slower than 
EXACT3 and approximate methods. 

In Figures 16(c) and 16(d) we analyze the quality of all approx- 
imation techniques as the query interval increases. AppxI and 
APPX2+ clearly have the best precision/recall and approximation 
ratio with a precision/recall above 99% and ratio very close to 1 
in all cases. APPX2 shows a slight decline in precision/recall from 
roughly 98% to above 90% as the size of (£2 — £1) increases from 
2% to 50% of the maximum temporal value T. This decrease in 
precision/recall is reasonable since as we increase (£2 — £1) the 
number of dyadic intervals which compose the approximate query 
interval [B(t\), #(£2)] typically increases. As the number of dyadic 
intervals increases there is an increased probability that not every 
candidate in K, will be in the top-k ma x over each of the dyadic 
intervals and so Appx2 will be missing some of a candidate's ag- 
gregate scores. This can cause an item to be falsely ejected from 
the top k. The effect of missing aggregate scores is clearly seen 
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in Figure 16(d), which shows Appx2's approximation ratio drops 
slightly as the time range increases. 

k and /c ma x. We studied the effect of k and /c ma x; the results are 
shown in Figures 17 and 18. Figures 17(a) and 17(b) show that the 
query performance of most methods is not affected by the value of 
k when it changes from 10 to /c ma x = 200 (a relatively small to 
moderate change w.r.t. the database size) except for Appx2 and 
APPX2+. This results since larger k values lead to more candidates 
in JC, which results in higher query cost. Nevertheless, they still 
have better IOs than the best exact method Exact3, and much 
better query cost (still 2 orders of magnitude improvement in the 
worst case, which can be attributed to the caching effect by the 
OS). Figure 17(c) and 17(d) show some fluctuation, but no trending 
changes in accuracy due to variation in k. 

We vary /c max from 50 to 500 in Figure 18. /c max obviously 
has no effect on exact methods. It linearly affects the construction 
cost and the size of index for AppxI and Appx2, but they are still 
much better than exact methods even when /c max = 500. In terms 
of query cost, given the same k values, /c max does not clearly affect 
any approximate methods when it only changes moderately w.r.t. 
the database size. 

Updates. As suggested by the cost analysis, the update time for 
each index structure is roughly proportional to the build time di- 



(c) I/Os. (d) Query time, 

dataset evaluation. 

vided by the number of segments. Relative to these build times 
over N, however, Exact 1 is slower because it cannot bulk load, 
and EXACT2 and APPX2+ are faster because they only update a 
single B+-tree. For space, we omit these results. 

Meme dataset. We have also tested all our methods on the full 
Meme dataset (using still r = 500 breakpoints for all approximate 
methods), and the results are shown in Figure 19. In terms of the 
index size, three exact methods (and APPX2+) are comparable, as 
seen in Figure 19(a), while other approximate methods take much 
less space, by 3-5 orders of magnitude! In terms of the construction 
cost, it is interesting to note that EXACT 1 is the fastest to build in 
this case, due to the bulk-loading algorithm in the B+-tree (since 
all segments are sorted); while all other methods have some depen- 
dency on m. But approximate methods (excluding APPX2+) gen- 
erally are much faster to build than other exact methods as seen in 
Figure 19(b). They also outperform all exact methods by 3-5 orders 
of magnitude in IOs in Figure 19(c) and 3-4 orders of magnitude in 
running time in Figure 19(d). The best exact method for queries is 
still Exact3, which is faster than the other two exact methods by 
1-2 orders of magnitude. Finally, all approximate methods main- 
tain their high (or acceptable for APPX2) approximation quality on 
this very bursty dataset, as seen in Figure 20. Note Appx2 achieves 
this 90% precision/recall and close to 1 approximation ratio while 
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show better results than their basic versions AppxI-B and Appx2- 
B using Bi, given the same number of breakpoints, which agrees 
with the trend from the Temp dataset. 
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Figure 20: Quality of approximations on Meme. 

6. RELATED WORK 

To the best of our knowledge, ranking temporal data based on 
their aggregation scores in a query interval has not been studied 
before. Ranking temporal data based on the instant top-k definition 
has been recently studied in [15], however, as we have pointed out 
in Section 1, one cannot apply their results in our setting. In another 
work on ranking temporal data [14], they retrieve k objects that are 
always amongst the top-k list at every time instance over a query 
time interval. Clearly, this definition is very restrictive and may not 
even have k objects satisfying this condition in a query interval. 
This could be relaxed to require an object to be in the top-k list at 
most time instances of an interval, instead of at all time instances, 
like the intuition used in finding durable top-/c documents [20], but 
this has yet to be studied in time series/temporal data. Even then, 
ranking by aggregation scores still offers quite different semantics, 
is new, and, is useful in numerous applications. 

Our study is related to work on temporal aggregation [22,21]. As 
mentioned in Section 2, [22,21] focus on multi-versioned keys (in- 
stead of time series data), and their objective is to compute a single 
aggregation of all keys alive in a query time interval and/or a query 
key range, which is different from our definition of aggregation, 
which is to compute an aggregation over a query time interval, one 
per object (then rank objects based on their aggregation values). 

Approximate versions of [22,21] were presented in Tao et.al. [18, 
19], which also leveraged on a discretization approach (the general 
principle behind the construction of our breakpoints). As their goal 
is to approximate aggregates over all keys alive in any query rect- 
angles over the time and the key dimensions (a single aggregate 
per query rectangle), instead of time-aggregates over each element 
individually, their approach is not appropriate for our setting. 

Our methods require the segmentation of time series data, which 
has been extensively studied, and the general principles appear in 
Section 1. A more detailed discussion of this topic is beyond the 
scope of this work and we refer interested readers to [17, 12, 16,6, 1]. 

7. CONCLUSION 

We have presented a comprehensive study on ranking large tem- 
poral data using aggregate scores of temporal objects over a query 
interval which has numerous applications. Our best exact method 
Exact 3 is much more efficient than baseline methods, and our ap- 
proximate methods offer further improvements. Interesting open 
problems include ranking with holistic aggregations (e.g. median 
and quantiles), and extending to the distributed setting. 
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10. APPENDIX 

Lemma 6 An algorithm G that satisfies Definition 1 implies an 
algorithm R that satisfies Definition 2. 

PROOF. G creates A(k, £1 , £2) by finding the top k objects and 
approximate scores ranked by 3^(£ l5 £ 2 ). By the definition of G, 
^2) is an (e, a) -approximation of ^£(3 ; \(£i, £2). To see 
&A(j) C* 1 ' * s an ( £ ' a ) -approximation of ctau) > ^ 2 )' note mat 
all j objects A(j') for j' £ [0,j] satisfy that crA(j')(ti, £2) > 
a A {j'){t 1 ,t 2 )/oL-£M > o- A (j)(ti,t 2 )/a-eM, so (£1, £ 2 ) 
is at least as large this lower bound. There must be m — j — 1 objects 
i with <7i(£i,£ 2 ) < o- A (j)(t 1 ,t 2 ) + sM, implying 8^ (j) (£1, £2) 

< (T A (j)(tl,t 2 ) +sM. □ 
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